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Heavy-tailed high-dimensional data are commonly encountered in 
various scientific fields and pose great challenges to modern statistical 
analysis. A natural procedure to address this problem is to use pe- 
nalized quantile regression with weighted Li-penalty, called weighted 
robust Lasso (WR-Lasso), in which weights are introduced to ame- 
liorate the bias problem induced by the Li-penalty. In the ultra-high 
dimensional setting, where the dimensionality can grow exponentially 
with the sample size, we investigate the model selection oracle prop- 
erty and establish the asymptotic normality of the WR-Lasso. We 
show that only mild conditions on the model error distribution are 
needed. Our theoretical results also reveal that adaptive choice of the 
weight vector is essential for the WR-Lasso to enjoy these nice asymp- 
totic properties. To make the WR-Lasso practically feasible, we pro- 
pose a two-step procedure, called adaptive robust Lasso ( AR-Lasso) , 
in which the weight vector in the second step is constructed based 
on the Li-penalized quantile regression estimate from the first step. 
This two-step procedure is justified theoretically to possess the oracle 
property and the asymptotic normality. Numerical studies demon- 
strate the favorable finite-sample performance of the AR-Lasso. 



1. Introduction. The advent of modern technology makes it easier to 
collect massive, large-scale data sets. A common feature of these data sets is 
that the number of covariates greatly exceeds the number of observations, a 
regime opposite to conventional statistical settings. For example, portfolio 
allocation with hundreds of stocks in finance involves a covariance matrix of 
about tens of thousands of parameters, but the sample sizes are often only in 
the order of hundreds (e.g., daily data over a year period (Fan et al., 2008)). 
Genome-wide association studies in biology involve hundreds of thousands 
of single-nucleotide polymorphisms (SNPs), but the available sample size 
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is usually in hundreds too. Data-sets with large number of variables but 
relatively small sample size pose great, unprecedented challenges and op- 
portunities, for statistical analysis. 

Regularization methods have been widely used for high-dimensional vari- 
able selection (Bickel and Li, 2006; Bickel et al., 2009; Efron et al., 2007; 
Fan and Li, 2001; Lv and Fan, 2009; Tibshirani, 1996; Zhang, 2010; Zou, 
2006). Yet, most existing methods such as penalized least-squares or penal- 
ized likelihood (Fan and Lv, 2011) are designed for light-tailed distributions. 
Zhao and Yu (2006) established the irrepresentable conditions for the model 
selection consistency of the Lasso estimator. Fan and Li (2001) studied the 
oracle properties of nonconcave penalized likelihood estimators for fixed di- 
mensionality. Lv and Fan (2009) investigated the penalized least-squares es- 
timator with folded-concave penalty functions in the ultra-high dimensional 
setting and established a nonasymptotic weak oracle property. Fan and Lv 
(2008) proposed and investigated the sure independence screening method 
in the setting of light-tailed distributions. The robustness of the aforemen- 
tioned methods have not yet been thoroughly studied and well understood. 

Robust regularization methods such as the least absolute deviation (LAD) 
regression and quantile regression have been used for variable selection 
in the case of fixed dimensionality. See, for example, Li and Zhu (2008); 
Wang, Li and Jiang (2007); Wu and Liu (2009); Zou and Yuan (2008). The 
penalized composite likelihood method was proposed in Bradic et al. (2011) 
for robust estimation in ultra-high dimensions with focus on the efficiency of 
the method. They still assumed sub-Gaussian tails. Belloni and Chernozhukov 
(2011) studied the Li-penalized quantile regression in high-dimensional sparse 
models where the dimensionality could be larger than the sample size. We 
refer to their method as robust Lasso (R-Lasso). They showed that the 
R-Lasso estimate is consistent at the near-oracle rate, and gave conditions 
under which the selected model includes the true model, and derived bounds 
on the size of the selected model, uniformly in a compact set of quantile in- 
dices. Wang (2012) studied the Li-penalized LAD regression and showed 
that the estimate achieves near oracle risk performance with a nearly uni- 
versal penalty parameter and established also a sure screening property fo 
such an estimator, van de Geer and Miiller (2012) obtained bounds on the 
prediction error of a large class of L\ penalized estimators, including quantile 
regression. Wang et al. (2012) considered the nonconvex penalized quantile 
regression in the ultra-high dimensional setting and showed that the ora- 
cle estimate belongs to the set of local minima of the nonconvex penalized 
quantile regression, under mild assumptions on the error distribution. 

In this paper, we introduce the penalized quantile regression with the 
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weighted Li-penalty (WR-Lasso) for robust regularization, as in Bradic et al. 
(2011). The weights are introduced to reduce the bias problem induced by 
the Li-penalty. The flexibility of the choice of the weights provides flexibility 
in shrinkage estimation of the regression coefficient. WR-Lasso shares a sim- 
ilar spirit to the folded-concave penalized quantile-regression (Wang et al., 
2012; Zou and Li, 2008), but avoids the nonconvex optimization problem. 
We establish conditions on the error distribution in order for the WR-Lasso 
to successfully recover the true underlying sparse model with asymptotic 
probability one. It turns out that the required condition is much weaker 
than the sub-Gaussian assumption in Bradic et al. (2011). The only condi- 
tions we impose is that the density function of error has Lipschitz property 
in a neighborhood around 0. This includes a large class of heavy-tailed distri- 
butions such as the stable distributions, including the Cauchy distribution. 
It also covers the double exponential distribution whose density function is 
nondifferentiable at the origin. 

Unfortunately, because of the penalized nature of the estimator, WR- 
Lasso estimate has a bias. In order to reduce the bias, the weights in WR- 
Lasso need to be chosen adaptively according to the magnitudes of the un- 
known true regression coefficients, which makes the bias reduction infeasible 
for practical applications. 

To make the bias reduction feasible, we introduce the adaptive robust 
Lasso (AR-Lasso). The AR-Lasso first runs R-Lasso to obtain an initial es- 
timate, and then computes the weight vector of the weighted Li-penalty 
according to a decreasing function of the magnitude of the initial estimate. 
After that, AR-Lasso runs WR-Lasso with the computed weights. We for- 
mally establish the model selection oracle property of AR-Lasso in the con- 
text of Fan and Li (2001) with no assumptions made on the tail distribution 
of the model error. In particular, the asymptotic normality of the AR-Lasso 
is formally established. 

This paper is organized as follows. First, we introduce our robust estima- 
tors in Section 2. Then, to demonstrate the advantages of our estimator, we 
show in Section 3 with a simple example that Lasso behaves sub-optimally 
when noise has heavy tails. In Section 4.1, we study the performance of the 
oracle-assisted regularization estimator. Then in Section 4.2, we show that 
when the weights are adaptively chosen, WR-Lasso has the model selection 
oracle property, and performs as well as the oracle-assisted regularization 
estimate. In Section 4.3, we prove the asymptotic normality of our proposed 
estimator. The feasible estimator, AR-Lasso, is investigated in Section 5. 
Finally Section 6 presents the results of the simulation studies as well as a 
genome-wide association study with SNPs. The proofs are relegated to the 
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Appendix. 

2. Adaptive Robust Lasso. Consider the linear regression model 
(2.1) y = X/3 + e, 

where y is an n-dimensional response vector, X = (xi, . . . , x n ) T = (xi, • • • , x p ) 
is an n x p fixed design matrix, (3 = (/3i, . . . , (3 P ) T is a p-dimensional regres- 
sion coefficient vector, and e = (ei, . . . , e n ) T is an n-dimensional error vector 
whose components are independently distributed and satisfy P(ei < 0) = r 
for some known constant r G (0, 1). Under this model, x^/3 is the condi- 
tional rth-quantile of yi given Xj . We impose no conditions on the heaviness 
of the tail probability or the homoscedasticity of £{ . We consider a challeng- 
ing setting in which \ogp = o(n b ) with some constant b > 0. To ensure the 
model identifiability and to enhance the model fitting accuracy and inter- 
pretability, the true regression coefficient vector (3* is commonly imposed 
to be sparse with only a small proportion of nonzeros (Fan and Li, 2001; 
Tibshirani, 1996). Denoting the number of nonzero elements of the true re- 
gression coefficients by s n , we allow s n to slowly diverge with the sample 
size n and assume that s n = o{n). To ease the presentation, we suppress the 
dependence of s n on n whenever there is no confusion. Without loss of gener- 
ality, we write (3* = , T ) T , i.e. only the first s entries are non- vanishing. 
The true model is denoted by 

M, = supp(/3*) ={!,-■■ ,s}, 

and its complement, = {s + 1, ■ ■ ■ ,p}, represents the set of noise vari- 
ables. 

We consider a fixed design matrix in this paper and denote by S = 
(Si, • • • , S„) T = (xi, • • • , x s ) the submatrix of X corresponding to the co- 
variates whose coefficients are non-vanishing. These variables will be re- 
ferred to as the signal covariates and the rest will be called noise covariates. 
The set of columns that correspond to the noise covariates are denoted by 
Q = (Q 1; • • • , Q n ) T = (x s+ i, • • • ,x p ). We standardize each column of X to 
have L2-norm y/n. 

To recover the true model and estimate (3*, we consider the following 
regularization problem 

n p 
( 2 - 2 ) S { " X f/3) + nX n ]>>A„ (1/3,1)}, 
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where p T (u) = u{t — l{u < 0}) is the quantile loss function, and p\ n (-) is 
a nonnegative penalty function on [0, oo) with a regularization parameter 
A n > 0. The use of quantile loss function in (2.2) is to overcome the diffi- 
culty of heavy tails of the error distribution. Since P(e < 0) = r, (2.2) can 
be interpreted as the sparse estimation of the conditional rth quantile. Re- 
garding the choice of p\ n (-), it was demonstrated in Lv and Fan (2009) and 
Fan and Lv (2011) that folded-concave penalties are more advantageous for 
variable selection in high dimensions than the convex ones such as the L\- 
penalty. It is, however, computationally more challenging to minimize the 
objective function in (2.2) when p\(-) is folded-concave. Noting that with a 

good initial estimate j3 = ■ ■ ■ ,{3™ l ) T of the true coefficient vector, 

we have 

Thus, instead of (2.2) we consider the following weighted Li-regularized 
quantile regression 

n 

(2.3) L n (/3) = ^p r (y i -xf/3) + nA n ||do/3|| 1 , 

i=l 

where d = (d\, • • • , d p ) T is the vector of non-negative weights, and o is the 
Hadamard product, i.e., the componentwise product of two vectors. This 
motivates us to define the weighted robust Lasso (WR-Lasso) estimate as 
the global minimizer of the convex function L n ((3) for a given non-stochastic 
weight vector: 

(2.4) 3 = argmin /3 L n (/3). 

The uniqueness of the global minimizer is easily guaranteed by adding a 
negligible L2-regularization in implementation. In particular, when dj = 1 
for all j, the method will be referred to as robust Lasso (R-Lasso). 

The adaptive robust Lasso (AR-Lasso) refers specifically to the two-stage 
procedure in which the stochastic weights dj = p'x n (\P} m \) f° r 3 = 1> " ' iV 
are used in the second step for WR-Lasso and are constructed using a con- 
cave penalty p\ n (-) and the initial estimates, /3* m , from the first step. In 
practice, we recommend using R-Lasso as the initial estimate and then us- 
ing SCAD to compute the weights in AR-Lasso. The asymptotic result of 
this specific AR-Lasso is summarized in Corollary 1 in Section 5 for the ultra- 
high dimensional robust regression problem. This is a main contribution of 
the paper. 
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3. Suboptimality of Lasso. In this section, we use a specific example 
to illustrate that, in the case of heavy-tailed error distribution, Lasso fails 
at model selection unless the non-zero coefficients, /3*,--- ,f3*, have a very 
large magnitude. We assume that the errors 81, • • • ^s n have the identical 
symmetric stable distribution and the characteristic function of s\ is given 
by 

E[ex.p(iusi)] = exp(— \u\ a ), 

where a 6 (0,2). By Nolan (2012), E\ei\ p is finite for < p < a, and 
= 00 for p > a. Furthermore as z — > 00, 

P(|ei| >z)~c a z~ a , 

where c a = sin(^)r(a)/7r is a constant depending only on a, and we use 
the notation ~ to denote that two terms are equivalent up to some constant. 
Moreover, for any constant vector a = (ai, • • • , a n ) T , the linear combination 
& T s has the following tail behavior 

(3.1) ^(|a T £| > z) ~ ||a||^c a z- Q , 

with || • ||q, denoting the L a -norm of a vector. 

To demonstrate the suboptimality of Lasso, we consider a simple case in 
which the design matrix satisfies the conditions that S T Q = 0, ^S T S = 
I s , the columns of Q satisfy |supp(xj)| = m n = 0(n 1 / 2 ) and supp(xfc) n 
supp(xj) = for any k 7^ j and k,j S {s + 1, • • • ,p}. Here, m n is a 
positive integer measuring the sparsity level of the columns of Q. We assume 
that there are only fixed number of true variables, i.e., s is finite, and that 
maxjj \xij\ = 0(n 1//4 ). Thus, it is easy to see that p = O^ 1 / 2 ). In addition, 
we assume further that all nonzero regression coefficients are the same and 

p* = ... = p* s =fo> 0. 

We first consider R-Lasso, which is the global minimizer of (2.4). We will 
later see in Theorem 2 that by choosing the tuning parameter 

A n = O ((log n) 2 V '(log p)/n), 

R-Lasso can recover the true support A^* = {1, ■ • • , s} with probability 
tending to 1. Moreover, the sign of the true regression coefficients can also be 
recovered with asymptotic probability one as long as the following condition 
on signal strength is satisfied 



(3.2) AnVo ->• oo, i-e. (logn) 2 Vra/(logp)/3 -> 00. 
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Now, consider Lasso, which minimizes 



(3.3) 



L n (P) = - ||y-X/3||! + nA n ||/3||i. 



We will see that for (3.3) to recover the true model and the correct sign 
of coefficients, we need a much stronger signal level than that is given in 
(3.2). By results in optimization theory, the Karush-Kuhn-Tucker ^KKT) 
conditions guaranteeing the necessary and sufficient conditions for (3 with 
M. = supp(/3) being a minimizer to (3.3) are 



where M c is the complement of A4, f3j^ is the subvector formed by entries 
of /3 with indices in A4, and X^vi and X^vfc are the submatrices formed 
by columns of X with indices in Ai and A4 C , respectively. It is easy to see 
from the above two conditions that for Lasso to enjoy the sign consistency, 
sgn(/3) = sgn(/3*) with asymptotic probability one, we must have these two 
conditions satisfied with A4 = A4* with probability tending to 1. Since we 
have assumed that Q r S = and n _1 S T S = I, the above sufficient and 
necessary conditions can also be written as 



Conditions (3.4) and (3.5) are hard for Lasso to hold simultaneously. The 
following proposition summarizes the necessary condition, whose proof is 
given in Section 8.6. 

Proposition 1. In the above model, with probability at least 1 — e~ c ° , 
where cq is some positive constant, Lasso does not have sign consistency, 
unless the following signal condition is satisfied 

3 1 

(3.6) n3"/3o —> oo. 

Comparing this with (3.2), it is easy to see that even in this simple case, 
Lasso needs much stronger signal levels than R-Lasso in order to have a sign 
consistency in the presence of a heavy-tailed distribution. 



@M +n\ n (X. r j vl X M ) 1 sgn(P M ) = ( X M X m) ^^y, 
ll X L c (y - ^mPm)\\oo < n\ n , 



(3.4) 
(3.5) 



$M* + ^nSgn^^,) = (3* M 
||Q T £||oo < nX n 
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4. Model Selection Oracle Property. In this section, we establish 
the model selection oracle property of WR-Lasso. The study enables us to 
see the bias due to penalization, and that an adaptive weighting scheme is 
needed in order to eliminate such a bias. We need the following condition 
on the distribution of noise. 

Condition 1. There exist uniform constants c\ > and ci > such 
that for any u satisfying \u\ < c\, f%{u) 's are uniformly bounded away from 
and oo and 



where fi(u) and Fi(u) are the density function and distribution function of 
the error e^, respectively. 

Condition 1 implies basically that each fi{u) is Lipschitz around the ori- 
gin. Commonly used distributions such as the double-exponential distribu- 
tion and stable distributions including the Cauchy distribution all satisfy 
this condition. 

Denote by H = diag{/i(0), • • • , / n (0)}. The next condition is on the sub- 
matrix of X that corresponds to signal covariates and the magnitude of the 
entries of X. 

Condition 2. The eigenvalues o/^S^HS are bounded from below and 
above by some positive constants cq and 1/cq, respectively. Furthermore, 



Although Condition 2 is on the fixed design matrix, we note that the 
condition on n n above is satisfied with asymptotic probability one when 
the design matrix is generated from some distributions. For instance, if the 
entries of X are independent copies from a sub-exponential distribution, 
the bound on K n is satisfied with asymptotic probability one as long as 
s = o(-y/n/(logp)) ; if the components are generated from sub-Gaussian dis- 
tribution, then the condition on K n is satisfied with probability tending to 
one when s = o(y / n/(logp)) . 

4.1. Oracle Regularized Estimator. To evaluate our newly proposed method, 
we first study how well one can do with the assistance of the oracle informa- 
tion on the locations of signal covariates. Then, we use this to establish the 
asymptotic property of our estimator without the oracle assistance. Denote 
by (3 = (( / 9 1 ) T , T ) T the oracle regularized estimator (ORE) with (3 1 G R s 
and being the vector of all zeros, which minimizes L n (j3) over the space 



Fi(u) - Fi(0) - ufi(0)\ <c 2 u 2 
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{(3 = (0f,(3l) T G R p : (3 2 = G R p " s }. The next theorem shows that 
ORE is consistent, and estimates the correct sign of the true coefficient vec- 
tor with probability tending to one. We use do to denote the first s elements 
of d. 

Theorem 1. Let j n = Ci(y/s(\ogn)/n + A n ||do||2) with C\ > a con- 
stant. If Conditions 1 and 2 hold and A n ||do||2\/* K n ~~ ^ 0; then there exists 
some constant c > such that 



(4.1) J P(||/3 1 -^|| 2 <7nJ>l-n- cs . 

If in addition 7" 1 mini<j< s — > oo, then with probability at least l — n~ cs , 

sgn(3°) = sgn(/3J), 
where the above equation should be understood componentwisely. 

As shown in Theorem 1, the consistency rate of j3 1 in terms of the vector 
L2-norm is given by j n . The first component of j n , Ciy/s(\ogn)/n, is the 
oracle rate within a factor of logn, and the second component CiA n ||do||2 
reflects the bias due to penalization. If no prior information is available, one 
may choose equal weights do = (1, 1, • • • , 1) T , which corresponds to R-Lasso. 
Thus for R-Lasso, with probability at least 1 — n~ cs , it holds that 



(4.2) H0! - Pth <ln = Ci(\A(logn)/n + v^A n )- 

4.2. WR-Lasso. In this section, we show that even without the oracle 
information, WR-Lasso enjoys the same asymptotic property as in Theorem 
1 when the weight vector is appropriately chosen. Since the regularized es- 
timator (3 in (2.4) depends on the full design matrix X, we need to impose 
the following conditions on the design matrix to control the correlation of 
columns in Q and S. 

Condition 3. With ^ n defined in Theorem 1, it holds that 

A () 



1 rp 

Q HS 

n 



< 



2,oo 2||d 1 1 ||oo7n ' 



where 1 1 A || 2,00 = sup x ^ II Axlloo/llxl^ for a matrix A and vector x, and 
d^ 1 = {dj+i,--- ,d~ 1 ) T . Furthermore, log(p) = o(n b ) for some constant 

be (0,1). 



10 



FAN ET AL. 



To understand the implications of Condition 3, we consider the case of 
/i(0) = • • • = / n (0) = /(0). In the special case of Q T S = 0, Condition 3 
is satisfied automatically. In the case of equal correlation, that is, n 
having off-diagonal elements all equal to p, the above Condition 3 reduces 
to 

m 4/(0) Hdr 1 ||oo V^7n' 

This puts an upper bound on the correlation coefficient p for such a dense 
matrix. 

It is well known that for Gaussian errors, the optimal choice of regular- 
ization parameter A n has the order y (logp)/n (Bickel et al., 2009). The 
distribution of the model noise with heavy tails demands a larger choice of 
A n to filter the noise for R-lasso. When A n > y/ (log n)/n, 7„ given in (4.2) 
is in the order of Ci\ n yfs. In this case, Condition 3 reduces to 

(4.3) Un-^HSIkoo < 0(s- 1/2 ). 

For WR-Lasso, if the weights are chosen such that 1 1 do 1 1 2 = 0{^/ s{\ogn) / n / \ n ) 
and lldiHoo = O(l), then 7„ is in the order of Ciy/s(logn)/n, and corre- 
spondingly, Condition 3 becomes 

||n- 1 Q r HS|| 2 , 00 < 0(A nV /n/(s(logn))). 

This is a more relaxed condition than (4.3), since with heavy-tailed errors, 
the optimal A n should be larger than yj (Jogpf/n. In other words, WR-Lasso 
not only reduces the bias of the estimate, but also allows for stronger cor- 
relations among the signal and noise covariates. However, the above choice 
of weights depends on unknown locations of signals. A data-driven choice 
will be given in Section 5, in which the resulting AR-Lasso estimator will be 
studied. 

The following theorem shows the model selection oracle property of the 
WR-Lasso estimator. 

Theorem 2. Suppose Conditions 1-3 hold. In addition, assume that 
minj> s +i dj > C3 with some constant C3 > 0, 

(4.4) 7 n s 3/2 ^(log 2 n) 2 = o(nA^), A n ||d || 2 K n max{Vs, lldolb} -> 0, 

and X n > 2\J (1 + c)(logp)/n, where K n is defined in Condition 2, 7 n is 
defined in Theorem 1, and c is some positive constant. Then, with probability 

at least 1 — 0(n~ cs ), there exists a global minimizer j3 = ({(5 l ) T , j3 2 ) T of 
L n (/3) which satisfies 
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1) 3 2 = 0; 

2) ||3i-/3!l| 2 <7n. 

Theorem 2 shows that the WR-Lasso estimator enjoys the same property 
as ORE with probability tending to one. However, we impose non-adaptive 
assumptions on the weight vector d = ,df) T . For noise covariates, we 
assume mmj >s dj > C3, which implies that each coordinate needs to be 
penalized. For the signal covariates, we impose (4.4), which requires | [ do [ 1 2 
to be small. 

When studying the nonconvex penalized quantile regression, Wang et al. 
(2012) assumed that n n is bounded and the density functions of £j's are 
uniformly bounded away from and 00 in a small neighborhood of 0. Their 
assumption on the error distribution assumption is weaker than our Condi- 
tion 1. We remark that the difference is because we have weaker conditions 
on K n and the penalty function (See Condtion 2 and (A. 19)). In fact, our 
Condition 1 can be weakened to the same condition as that in Wang et al. 
(2012) at the cost of imposing stronger assumptions on n n and the weight 
vector d. 

Belloni and Chernozhukov (2011) and Wang (2012) imposed the restricted 
eigenvalue assumption of the design matrix and studied the Li-penalized 
qunatile regression and LAD regression, respectively. We impose different 
conditions on the design matrix and allow flexible shrinkage by choosing d. 
In addition, our Theorem 2 provides a stronger result of model selection 
oracle property than consistency. 

4.3. Asymptotic Normality. We now present the asymptotic normality of 
our estimator. Define V n = (S^'HS)" 1 / 2 and Z n = (Z n i, • • • , r L nn ) T = SV n 
with Z n j E R s for j = 1, ■ ■ ■ , n. 

Theorem 3. Assume the conditions of Theorem 2 hold, the first and 
second order derivatives f[{u) and f"(u) are uniformly bounded in a small 
neighborhood of for all i = 1, • • • ,n, and that 1 1 cio 1 1 2 = 0(yjs/n /A n ), 
maxj ||H 1 / 2 Z n j||2 = o(s~ 7//2 (log s) _1 ) ; and ^n/ smm.\<j< s |/3|| — > 00. Then, 

with probability tending to 1 there exists a global minimizer (3 = {{(3 1 ) T , (3 2 ) T 
of L n ((3) such that (3 2 = 0. Moreover, 

c T {ZlZ n y^V^ [0; - ft) + ^V 2 d ] A N (0, r(l - r)) , 

where c is an arbitrary s-dimensional vector satisfying c T c = 1, and do is 
an s-dimensional vector with the jth element (i,sgn(/3p. 
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The proof of Theorem 3 is an extension of the proof on the asymptotic 
normality theorem for the LAD estimator in Pollard (1990), in which the the- 
orem is proved for fixed dimensionality. The idea is to approximate L n ((3 1 , 0) 
in (2.4) by a sequence of quadratic functions, whose minimizers converge to 
normal distribution. Since L n ( / 9 1 ,0) and the quadratic approximation are 
close, their minimizers are also close, which results in the asymptotic nor- 
mality in Theorem 3. 

Theorem 3 assumes that maxj ||H 1 / 2 Z n j||2 = o(s _7 / 2 (log s)" 1 ). Since by 
definition X^t=i l|H 1 ^ 2 Z n j||2 = s, it is seen that the condition implies s = 
o(n 1 / 8 ). This assumption is made to guarantee that the quadratic approx- 
imation is close enough to L n (/3 1 ,0). When s is finite, the condition be- 
comes maxj ||Z n j||2 = o(l), as in Pollard (1990). Another important as- 
sumption is Arj-v/nHdol^ = 0(\/i), which is imposed to make sure that the 
bias 2 _1 nA n c T V n do caused by the penalty term does not diverge. For in- 
stance, using R-Lasso will create a non-diminishing bias and thus cannot be 
guaranteed to have asymptotic normality. 

5. Properties of the Adaptive Robust Lasso. In previous sections, 
we have seen that the choice of the weight vector d plays a pivotal role 
for the WR-Lasso estimate to enjoy the model selection oracle property 
and asymptotic normality. In fact, conditions in Theorem 2 require that 
minj> s+ i dj > C3 and that 1 1 cio 1 1 2 does not diverge too fast. Theorem 3 
imposes an even more stringent condition, 1 1 do 1 1 2 = 0(y/s/n /A n ), on the 
weight vector do- For R-Lasso, 1 1 cio 1 1 2 = \fs and these conditions become 
very restrictive. For example, the condition in Theorem 3 becomes A n = 
0(n -1 / 2 ), which is too low for a thresholding level even for Gaussian errors. 
Hence, an adaptive choice of weights is needed to make those conditions 
satisfied. To this end, we propose a two-step procedure. 

In the first step, we use R-Lasso, which gives the estimate (3 . As has 
been shown in Belloni and Chernozhukov (2011) and Wang (2012), R-Lasso 
is consistent at a near-oracle rate \J s(logp) jn and selects the true model 
Ai* as a submodel (in other words, R-Lasso has the sure screening property 
using the terminology of Fan and Lv (2008)) with asymptotic probability 
one, namely, 

supp(3 m? ) D supp(/3*) and \0T - f3\\\ 2 = 0{yj s{\ogp)/n). 

We remark that our Theorem 2 also ensures the consistency of R-Lasso. 
Compared to Belloni and Chernozhukov (2011), Theorem 2 presents stronger 
results but also needs more restrictive conditions for R-Lasso. As will be 
shown in latter theorems, only the consistency of R-Lasso is needed in the 



ADAPTIVE ROBUST VARIABLE SELECTION 



13 



study of AR-Lasso, so we quote the results and conditions on R-Lasso in 
Belloni and Chernozhukov (2011) with the mind of imposing weaker condi- 
tions. 

In the second step, we set d = (d\, ■ ■ ■ ,d p ) T with dj = p'\ n (\(3j ni \) where 
Pa„(|"|) is a folded concave penalty function, and then solve the regularization 
problem (2.4) with a newly computed weight vector. Thus, vector do is 
expected to be close to the vector (j/>^(|/3f |), • • • ,p' x (|/3*|)) T under L2-norm. 
If a folded concave penalty such as SCAD is used, then p' Xn (\(3j\) will be 

close or even equal to zero for 1 < j ' < s and thus the magnitude of | [ do 1 1 2 
is negligible. 

Now, we formally establish the asymptotic properties of AR-Lasso. We 
first present a more general result and then highlight our recommended 
procedure, which uses R-Lasso as the initial estimate and then uses SCAD to 
compute the stochastic weights, in Corollary 1. Denote by d* = (d\, ■ ■ ■ , d*) 

with d* = p'\ n (\!3j\)- Using the weight vector d, AR-Lasso minimizes the 
following objective function 

n 

(5.1) L n (J3) = Pr(Vi - xf (3) + n\ n \\d oPWl 

i=l 

We also need the following conditions to show the model selection oracle 
property of the two-step procedure. 

Condition 4. With asymptotic probability one, the initial estimate sat- 
isfies — (3*\\2 < C2y/s(\ogp)/n with some constant C*2 > 0. 

As discussed above, if R-Lasso is used to obtain the initial estimate, it 
satisfies the above condition. Our second condition is on the penalty func- 
tion. 

Condition 5. p\ (t) is non-increasing in t G (0, oo) and is Lipschitz 
with constant c§, that is, 

|PA n (|A|)-PA„(l^|)|<C 5 |y9i-^|, 

for any Pi, ^2 G R- Moreover, p 1 ^ (C2 yJs(Jogp)/n) > \p'\ n {^+) for large 
enough n, where C2 is defined in Condition 4- 

For the SCAD (Fan and Li, 2001) penalty, p' Xn {P) is given by 

(5.2) p' x J(3) = l{/3 < X n } + {a , Xn ~^ + l{P > A n }, 
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for a given constant a > 2, and it can be easily verified that Condition 5 
holds if A n > 2(a + 1) _1 C 2 y/s(logp)/n. 

Theorem 4. Assume conditions of Theorem 2 hold with d = d* and 
7 n = a n , where 

a n = C 3 ^ v / s(logn)/n + A n (||do|| 2 + C 2 c 5V / s (log p)/n)^j , 

with some constant C3 > 0, and X n SK n -\/ (log p)/n — > 0. Then, under Condi- 
tions 4 &nd 5, with probability tending to one, there exists a global minimizer 

3 = (3i\3 2 ) T °f i 5 - 1 ) such that P2 = and ||3i - Plh < a n- 

The results in Theorem 4 are analogous to those in Theorem 2. The extra 
term X n -\/ s(logp) jn in the convergence rate a n , compared to the convergence 

rate j n in Theorem 2, is caused by the bias of the initial estimate (3 . Since 
the regularization parameter \ n goes to zero, the bias of AR-Lasso is much 
smaller than that of the initial estimator /3 . Moreover, the AR-Lasso (3 
possesses the model selection oracle property. 

Now we present the asymptotic normality of the AR-Lasso estimate. 

Condition 6. The smallest signal satisfies mini<j< s |/3|| > 2Ciy (slogp) /n. 
Moreover, it holds that p'^ (|/3|) = o(s _1 A~ 1 (nlogp) -1 / 2 ) for any |/3| > 
2" 1 mini< j < s \(3*\. 

The above condition on the penalty function is satisfied when the SCAD 
penalty is used and mini<j< s |/3?| > 2aA n where a is the parameter in the 
SCAD penalty (5.2). 

Theorem 5. Assume conditions of Theorem 3 hold with d = d* and 
In = a n , where a n is defined in Theorem 4- Then, under Conditions 4 ~ 6, 
with asymptotic probability one, there exists a global minimizer /3 of (5.1) 
having the same asymptotic properties as those in Theorem 3. 

With the SCAD penalty, conditions in Theorems 4 and 5 can be simplified 
and AR-Lasso still enjoys the same asymptotic properties, as presented in 
the following corollary. 

Corollary 1. Assume X n = 0(^/ s (log p) (log log n)/n), \ogp = o(^/n), 
mini<j< s \ > 2a\ n with a the parameter in the SCAD penalty, and K n = 
o(?i 1 / 4 s _1 / 2 (logn) _3 / 2 (logp) 1 / 2 ). Further assume that ||n _1 Q T HS|| 2i00 < 
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C4 (log p) (log log n) I log n with C4 some positive constant. Then, under 
Conditions 1 and 2, with asymptotic probability one, there exists a global 



\\A-P*ih<OWs(logn)/n), sgn(3 1 ) = sgn(#), and % = 0. 



where c zs an arbitrary s-dimensional vector satisfying c c = 1. 

Corollary 1 provides sufficient conditions for ensuring the variable selec- 
tion sign consistency of AR-Lasso. These conditions require that R-Lasso in 
the initial step has the sure screening property. We remark that in implemen- 
tation, AR-Lasso is able to select the variables missed by R-Lasso, as demon- 
strated in our numerical studies in the next section. The theoretical compari- 
son of the variable selection results of R-Lasso and AR-Lasso would be an in- 
teresting topic for future study. One set of (p, n, s, K n ) satisfying conditions in 
Corollary 1 is logp = 0(n bl ),s = o(n( 1 ~ fcl ^ 2 ) and K n = o(n fel//4 (log n)~ 3 / 2 ) 
with b\ G (0, 1/2) some constant. Corollary 1 gives one specific choice of A n , 
not necessarily the smallest A n , which makes our procedure work. In fact, 

the condition on A n can be weakened to A n > 2(a + 1)~ 1 ||/3 1 — /3i||oo- Cur- 
rently, we use the L2-norm \\f3i — /3 X 1 1 2 to bound this Loo-norm, which is 
too crude. If one can establish \\(3 1 — {3\\\oo = O p (y / n _1 logp) for an initial 

estimator (3 l , then the choice of A n can be as small as a 0{^J n~ x logp), 
the same order as that used in Wang (2012). On the other hand, since we 
are using AR-Lasso, the choice of A n is not as sensitive as R-Lasso. 

6. Numerical Studies. In this section we evaluate the finite sample 
property of our proposed estimator with simulation studies. In Section 6.1, 
we present results on synthetic data, and in Section 6.2 we perform an eQTL 
study and compare the behaviour of AR-Lasso with other state-of-the-art 
methods. 

6.1. Finite Sample Study. To assess the performance of the proposed 
estimator and compare it with other methods, we simulated data from the 
high-dimensional linear regression model 




If in addition, maxj HH 1 ' 2 Z. 



2 = o(s 7 / 2 (logs) 1 ), then we also have 



c^Z^r^V; 1 ^ - PI) A JV(0,r(l - r)) 



Vi = x f Po + £ i 



ii 



x~AA(0,S^), 
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where the data had n = 100 observations and the number of parameters was 
chosen as p = 400. We fixed the true regression coefficient vector as 

(3 = {2, 0, 1.5, 0, .80, 0, 0, 1, 0, 1.75, 0, 0, .75, 0, 0, 0.3, 0, . . . , 0} . 

For the distribution of the noise, e, we considered six symmetric distribu- 
tions: Normal with variance 2 (jV(0, 2)), a scale mixture of Normals for which 
of = 1 with probability 0.9 and of = 25 otherwise (MN±), a different scale 
mixture model where Si ~ N(0,of) and Oi ~ Unif(l,5) (MA^), Laplace, 
Student's t with degrees of freedom 4 with doubled variance (\/2 x i 4 ) and 
Cauchy. We take r = 0.5, corresponding to Li-regression, throughout the 
simulation. Correlation of the covariates, H x were either chosen to be iden- 
tity (i.e. Sa, = I p ) or they were generated from an AR(1) model with corre- 
lation 0.5, that is V x (i,j) = 0.5^1 

We implemented five methods for each setting: 

1. L2- Oracle, which is the least squares estimator based on the signal 
covariates. 

2. Lasso, the penalized least-squares estimator with Li-penalty as in 
Tibshirani (1996). 

3. SCAD, the penalized least-squares estimator with SCAD penalty as 
in Fan and Li (2001). 

4. R-Lasso, the robust Lasso defined as the minimizer of (2.4) with d = 1. 

5. AR-Lasso, which is the adaptive robust lasso whose adaptive weights 
on the penalty function were computed based on the SCAD penalty 
using the R-Lasso estimate as an initial value. 

The tuning parameter, X n , was chosen optimally based on 100 validation 
data-sets. For each of these data-sets, we ran a grid search to find the best 
A n (with the lowest L2 error for (3) for the particular setting. This optimal 
A n was recorded for each of the 100 validation data-sets. The median of 
these 100 optimal A n were used in the simulation studies. We preferred this 
procedure over cross-validation because of the instability of the L2 loss under 
heavy tails. 

The following four performance measures were calculated: 

1. L2 loss, which is defined as — /3 1 1 2- 

2. L\ loss, which is defined as ||/3* — /3||i- 

3. Number of noise covariates that are included in the model, that is the 
number of false positives (FP). 

4. Number of signal covariates that are not included, i.e. the number of 
false negatives (FN). 



ADAPTIVE ROBUST VARIABLE SELECTION 



17 



For each setting, we present the average of the performance measure based 
on 100 simulations. The results are depicted in Tables 1 and 2. A boxplot of 
the L2 losses under different noise settings is also given in Figure 1 (the L2 
loss boxplot for the independent covariate setting is similar and omitted). 
For the results in Tables 1 and 2, one should compare the performance 
between Lasso and AR-Lasso and that between SCAD and AR-Lasso. This 
comparison reflects the effectiveness of Li-regression in dealing with heavy- 
tail distributions. Furthermore, comparing Lasso with SCAD, and R-Lasso 
with AR-Lasso, shows the effectiveness of using adaptive weights in the 
penalty function. 

Our simulation results reveal the following facts. The quantile based esti- 
mators were more robust in dealing with the outliers. For example, for the 
first mixture model (JVfiVi) and Cauchy, R-Lasso outperformed Lasso, and 
AR-Lasso outperformed SCAD in all of the four metrics, and significantly so 
when the error distribution is the Cauchy distribution. On the other hand, 
for the light-tail distributions such as the normal distribution, the efficiency 
loss was limited. When the tails get heavier, for instance for the Laplace dis- 
tribution, quantile based methods started to outperform the least-squares 
based approaches, more so when the tails got heavier. 

The effectiveness of weights in AR-Lasso is self-evident. SCAD outper- 
formed Lasso and AR-Lasso outperformed Lasso in almost all of the settings. 
Furthermore, for all of the error settings AR-Lasso has significantly lower 
Li and L\ loss as well as a smaller model size compared to other estimators. 

It is seen that when the noise does not have heavy tails, that is for the 
normal and the Laplace distribution, all the estimators are comparable in 
terms of L\ loss. As expected, estimators that minimize squared loss worked 
better than R-Lasso and AR-Lasso estimators under Gaussian noise, but 
their performances deterioated as the tails get heavier. In addition, in the 
two heteroscadastic settings, AR-Lasso has the best performance among 
others. 

For Cauchy noise, least squares methods could only recover 1 or 2 of the 
true variables on average. On the other hand, Li-estimators (R-Lasso and 
AR-Lasso) had very few false negatives, and as evident from L2 loss values, 
these estimators only missed variables with smaller magnitudes. 

In addition, AR-Lasso consistently selected smaller set of variables than 
R-Lasso. For instance, for the setting with independent covariates, under the 
Laplace distribution, R-Lasso and AR-Lasso had on average 34.76 and 18.81 
false positives, respectively. Also note that AR-Lasso consistently outper- 
formed R-Lasso: It estimated (3* (lower L\ and L2 losses), and the support 
of (3* (lower averages for the number of false positives) more efficiently. 
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1/2 Oracle 


Lasso 


SCAD 


R-Lasso 


AR-Lasso 




L/2 loss 


0.833 


4.114 


3.412 


5.342 


2.662 


M (0,2) 


L\ loss 


0.380 


1.047 


0.819 


1.169 


0.785 




pp PKT 
r a , r IN 




27.00, 0.49 


29.60, 0.51 


36.81, 0.62 


17.27, 0.70 




_£/2 loss 


0.977 


5.232 


4.736 


4.525 


2.039 


1\ <T AT 

ivi jvi 


L\ loss 


0.446 


1.304 


1.113 


1.028 


0.598 




PP PM 

rr, r IN 




26.80, 0.73 


29.29, 0.68 


34.26, 0.51 


16.76, 0.51 




_/j , 2 loSS 


1.886 


7.563 


7.583 


8.121 


5.647 


MN 2 


Li loss 


0.861 


2.085 


2.007 


2.083 


1.845 




FP, FN 




20.39, 2.28 


23.25, 2.19 


24.64, 2.29 


11.97, 2.57 




1/2 10SS 


0.795 


4.056 


3.395 


4.610 


2.025 


Laplace 


L\ loss 


0.366 


1.016 


0.799 


1.039 


0.573 




FP, FN 




26.87, 0.62 


29.98, 0.49 


34.76, 0.48 


18.81, 0.40 




1/2 10SS 


1.087 


5.303 


5.859 


6.185 


3.266 




L\ loss 


0.502 


1.378 


1.256 


1.403 


0.951 




FP, FN 




24.61, 0.85 


36.95, 0.76 


33.84, 0.84 


18.53, 0.82 




1/2 10SS 


37.451 


211.699 


266.088 


6.647 


3.587 


Cauchy 


Li loss 


17.136 


30.052 


40.041 


1.646 


1.081 




FP, FN 




27.39, 5.78 


34.32, 5.94 


27.33, 1.41 


17.28, 1.10 



Table 1 

Simulation Results with Independent Covariates 





Z/2 Oracle 


Lasso 


SCAD 


R-Lasso 


AR-Lasso 




1/2 loss 


0.836 


3.440 


3.003 


4.185 


2.580 


Af(0,2) 


L\ loss 


0.375 


0.943 


0.803 


1.079 


0.806 




FP, FN 




20.62, 0.59 


23.13, 0.56 


22.72, 0.77 


14.49, 0.74 




L2 loss 


1.081 


4.415 


3.589 


3.652 


1.829 




L\ loss 


0.495 


1.211 


1.055 


0.901 


0.593 




FP, FN 




18.66, 0.77 


15.71, 0.75 


26.65, 0.60 


13.29, 0.51 




1/2 10SS 


1.858 


6.427 


6.249 


6.882 


4.890 


MN 2 


L\ loss 


0.844 


1.899 


1.876 


1.916 


1.785 




FP, FN 




15.16, 2.08 


14.77, 1.96 


18.22, 1.91 


7.86, 2.71 




Z/2 10SS 


0.803 


3.341 


2.909 


3.606 


1.785 


Laplace 


L\ loss 


0.371 


0.931 


0.781 


0.927 


0.573 




FP, FN 




19.32, 0.62 


21.60, 0.38 


24.44, 0.46 


12.90, 0.55 




1/2 10SS 


1.122 


4.474 


4.259 


4.980 


2.855 


v / 2xt 4 


L\ loss 


0.518 


1.222 


1.201 


1.299 


0.946 




FP, FN 




20.00, 0.76 


18.49, 0.91 


23.56, 0.79 


13.40, 1.05 




1/2 10SS 


31.095 


217.395 


243.141 


5.388 


3.286 


Cauchy 


L\ loss 


13.978 


31.361 


36.624 


1.461 


1.074 




FP, FN 




25.59, 5.48 


32.01, 5.43 


20.80, 1.16 


12.45, 1.17 



Table 2 

Simulation Results with Correlated Covariates 



6.2. Real Data Example. We now use expression quantitative trait locus 
(eQTL) mapping to illustrate the performance of R-Lasso and AR-Lasso. 
eQTL studies aim at finding the variations of genotype in a certain part of 
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(d) Laplace 




Oracle Lasso SCAD R-Lasso AR-Lasso 



(e) y/2 x U (f ) Cauchy 

Fig 1: Boxplots for L2 Loss with Correlated Covariates 



a chromosome that are associated with the gene expression levels. 

In this study, we conducted a cis-eQTL mapping for the gene CHRNA6, 
cholinergic receptor, nicotinic, alpha 6. CHRNA6 is located on the 8th chro- 
mosome, in the cytogenetic location 8pll. CHRNA6 is thought to be related 
to activation of dopamine releasing neurons with nicotine (Thorgeirsson et al., 
2010). Therefore, CHRNA6 has been the subject of many nicotine addic- 
tion studies on people with western European heritage (Saccone et al., 2009; 
Thorgeirsson et al., 2010). 

The data are from 90 individuals from the international 'HapMap' project 
(The International HapMap Consortium, 2005), all with western Europe an- 
cestry. The data are available on ftp://ftp.sanger.ac.uk/pub/genevar/. The 
normalized expression data was generated with an Illumina Sentrix Human-6 
Expression Bead Chip (Stranger et al., 2007). The SNPs under investigation 
are located at 1 megabase upstream and downstream of the transcription 
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start site (TSS) of CHRNA6; in this range, there were 554 SNPs. The addi- 
tive coding for SNPs was employed, with 0, 1 and 2 representing the major, 
heterozygous, and minor populations, respectively. We further screened the 
SNPs using a variation of the independent screening method (SIS) of Fan 
and Lv (2008). We kept the top 100 SNPs that had correlation with the 
gene expression levels. Finally, we applied Lasso, SCAD, R-Lasso and AR- 
Lasso to the screened variables. The quantile parameter, r was set to 0.5 for 
R-Lasso and AR-Lasso, corresponding to the median regression. The tuning 
parameter for all methods was chosen using a five-fold cross validation. The 
selected SNPs as well as their regression coefficients and distances from the 
main transcription site are given in Tables 3. 

It is seen that robust regression methods (R-Lasso and AR-Lasso) found 
more of the variables to be significant. R-Lasso and AR-Lasso selected 21 
and 15 variables, respectively, whereas Lasso and SCAD only found 15 and 
5 of the variables to be significant. Only 4 SNPs were included in all of the 
models, (rsl0504049, rs4466388, rs7818669, rs708190). Furthermore, none of 
these SNPs were covered in the previous study (Saccone et al., 2009). We 
speculate that the difference is due to the fact that the previous studies fo- 
cused on SNPs that are only 50 kb upstream and downstream. Additionally, 
these studies did not consider multiple regression which makes significant 
use of the correlation structure in the data. In addition, among all the SNPs 
that are chosen by these four methods, only one of them (rsl0958726) ap- 
pears in the paper by Saccone et al. (2009), and only R-Lasso and AR-Lasso 
found this SNP to be important. 

As it was observed in the finite sample simulations, SCAD and AR-Lasso 
consistently chose a smaller set of variables than their counterparts and are 
more reliable. Furthermore, almost two thirds of the relevant SNPs lie to 
the left of the transcription site. 

We would also like to note that the residuals from the fitted regressions 
had very heavy right tails. This suggests that, at least for this particular 
eQTL study, it is a lot more reasonable to use methods based on quantile 
regression. The QQ-plots of the residuals from different regression methods 
are shown in Figure 2. 

Acknowledgments. The authors sincerely thank the Co-Editor, Asso- 
ciate Editor, and three referees for their constructive comments that led to 
substantial improvement of the paper. 
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Table 3 







Selected SNPs for 


the eQTL study. 




SNP 


Lasso 


SCAD R-Lasso 


AR-Lasso Distance from TSS (in kb) 


rs7823138 




-0.0046 




-963 


rsl0090395 




0.0513 


0.1213 


-941 


rs3739368 




0.0684 




-921 


rs4737019 




0.0331 




-889 


rs7004640 


-0.0170 


-0.0114 




-872 


rs4737023 






-0.0124 


-849 


rsl0504049 


-0.0216 


-0.0096 -0.0082 


-0.0505 


-800 


rsll990460 




-0.0918 




-769 


rs6996712 




0.0853 


0.0139 


-694 


rs4466388 


0.1214 


0.1676 0.1603 


0.1299 


-681 


rs4736825 


0.0504 




0.1255 


-653 


rs7819109 




0.0716 




-564 


rs7012976 




-0.0925 




-529 


rs6474389 


0.0155 






-420 


rs3136797 






-0.0124 


-381 


rsl2542076 






-0.0167 


-247 


rsl3281070 




0.0615 




-233 


rs5024226 


0.0155 






-93 


rs4305884 


0.0155 






-89 


rsl0958726 




0.0513 


0.1213 


-18 


rs6985527 


-0.0170 


-0.0114 




54 


rsll995681 


-0.0216 


-0.0096 -0.0082 


-0.0505 


89 


rs7818669 


0.0538 


0.0114 0.1056 


0.1013 


123 


rsl 1775022 




0.0502 




138 


rsl0092934 


0.0155 






468 


rs7016102 


0.0770 


0.0856 


0.0293 


538 


rsl 1776934 




0.0853 


0.0139 


590 


rsl2545574 


-0.0363 


-0.0942 


-0.0093 


749 


rs9298634 


0.0467 


0.0171 


0.0753 


751 


rs4737107 


0.0222 






780 


rsl0098088 




0.0363 




809 



APPENDIX A: PROOFS 

We use techniques from empirical process to prove the theoretical results. 
Let v n (J3) = ELiPAVi ~ xf/3). Then L n ((3) = v n (J3) + nA n Ey=i^l^l- 
For a given deterministic M > 0, define the set 

Bq(M) = {/3 € R p : ||/3 - (3*\\ 2 < M,su PP (/3) C su PP (/3*)}. 

Then, define the function 

(A.l) Z n (M)= sup -\(v n {f3) - v n (f3*)) - E{v n (f3) - v n (p*))\. 

(36B (M) n 
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Fig 2: QQ-plots of residuals for different methods in the eQTL study 



Lemma 1 in Section A. 7 gives the rate of convergence for Z n (M). 

A.l. Proof of Theorem 1. We first show that for any (3 = ((3J, T ) T 
B Q {M) with M = o{k~ 1 s~ 1 / 2 ), 

(A.2) 



l 



E[v n (f3) - v n (0*)] > ^cocnp! - f3t\g 

for sufficiently large n, where c is the lower bound for /j(-) in the neighbor- 
hood of 0. The intuition follows from the fact that (3* is the minimizer of 
the function Ev n (f3) and hence in Taylor's expansion of E[v n ((3) — v n (j3*)] 
around (3*, the first order derivative is zero at the point (3 = (3* . The left- 
hand side of (A.2) will be controlled by Z n (M). This yields the L2-rate of 
convergence in Theorem 1. 

To prove (A.2), we set en = \Sj (fi x - (3\)\. Then, for f3 £ B (M), 



\(H\ < HSiHall/Si-^lb < V^K n M 



0. 
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Thus if Sj((3 1 - (3D > 0, by El{ei < 0} = r, Fubini's theorem, mean value 
theorem, and Condition 1 it is easy to derive that 

(A.3) 

E[p T {ei - Oj) - p T (£i)] = E[ai(l{ei < di] - t) - £il{0 < e, < a*}] 

= E[ [ ' 1{0 < Ei < s}ds] = I \f i (s) - Fi(0))ds = lfi(0)a! + o(l)al 
Jo Jo 1 

where the o(l) is uniformly over all i = 1, • • • , n. When Sj(j3 1 - f3\) < 0, 
the same result can be obtained. Furthermore, by Condition 2, 

n 
i=l 

This together with (A.3) and the definition of v n {(3) proves (A. 2). 

The inequality (A.2) holds for any (3 = ((3f,0 T ) T € B (M), yet 3° = 
((/3-l)" 2 ", t ) t may not be in the set. Thus, we let (3 = , T ) T , where 

3i = u3i + (i-«)i3;, with u = m/(m + h3°-^ii 2 ), 

which falls in the set jBo(M). Then, by the convexity and the definition of 

3°, 

L„(3)<«L n (3i,O) + (l-u)L n 03;,O) < L n (/3t,0) = L n (/3*). 
Using this and the triangle inequality we have 

E[v n (p)-v n ((3*)} = KG9*) - Ev n (J3*)} - K(3) - Ev n (0)} 

+L n (f3) - L n ((3*) + nA n ||d o - nA n ||d o p x \\ x 
(A.4) < nZ n (M) +n\ n \\d o(jr i -p 1 )\\ 1 . 

By the Cauchy-Schwarz inequality, the very last term is bounded by nA n ||do||2 
Plh <nA n ||d || 2 M. 

Define the event £ n = {Z n (M) < 2Mn _1//2 v / s log n}. Then by Lemma 1, 

(A.5) P(Sn) > 1 - exp(-c s(logn)/8). 

On the event £ n , by (A.4), we have 

E[v n (/3) - v n {(3*)} < 2M v / sn(logn) + nA n ||do||2M. 
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Taking M = 2^/s/n+ A n ||do|| 2 - By Condition 2 and the assumption A n ||do||2\/sKn — > 
0, it is easy to check that M = o(k~ 1 s~ l l 2 ). Combining these two results 
with (A. 2), we obtain that on the event £ n , 

-cquWPx - (3\\\\ < (2\J sn(logn) + raA n ||d ||2)(2\/ s/n + A n ||d ||2), 
which entails that 

||/3*i - 3x1b < O(A n ||d || 2 + y/s(\ogn)/n). 
Note that ||/3*-/3||2 < M implies ||3i ~ P*h < 2M. Thus, on the event S n , 
||3i ~ Pih < O(A n ||d || 2 + y/s(\ogn)/n). 
The second result follows trivially. 

A. 2. Proof of Theorem 2. Since [3 1 defined in Theorem 1 is a min- 
imizer of L n (/3 1 ,0), it satisfies the KKT conditions. To prove that (3 = 
{{(3 l ) T , T ) T £ R p is a global minimizer of L n ((3) in the original R p space, 
we only need to check the following condition 

(A.6) lld^ 1 o QV T (y - 831)1100 < n\ n , 

where p' T (u) = (p' T (ui),--- , p' T (u n )) T for any n-vector u = (t(i, • • • ,u n ) T 
with p' T (ui) = t — l{ui < 0}. Here, d^ 1 denotes the vector (d~^, • • • , d' 1 ) 7 . 
Then the KKT conditions and the convexity of L n ((3) together ensure that 
(3 is a global minimizer of L{(3). 
Define events 

^i = {||3i-/3il| 2 <7n} A 2 = {sup ||dr 1 oQVr(y-S/3i)|| 0O <nA B }, 

where j n is defined in Theorem 1 and 

AT = {(3 = ((3l , (3 T 2 ) T e : - # || 2 < 7n, /3 2 = e R*—, }. 

Then by Theorem 1 and Lemma 2 in Section A. 7, P{A\ C\A 2 ) > 1 - o(?i~ c,s ). 
Since f3 £ N on the event Ai, the inequality (A.6) holds on the event A\r\A 2 . 
This completes the proof of Theorem 2. 
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A. 3. Proof of Theorem 3. This proof is motivated by Pollard (1990). 
In the proof we use C > to denote a generic positive constant. 

Let 9 = V-^-ft), then/3, = ft+V n 0. Letting G n {9) = L n (p u 0)- 
L n (f3l,0), we have that 

(A.7) 

G n {9) = \\p T (e - Z n 0)||x - ||p T (e)||i + nA n (||d o {ft + V n 0) \\ x - ||d o ft]],), 

where we have used the shorthand notation that ||p T (u)||i = Y^i=iP-r{ u i) 
for any vector u = (u\, ■ ■ ■ ,u n ) T . Since L n (/3 1 ,0) is minimized at / 9 1 = (3±, 
it follows that G n {6) is minimized at n = V~ 1 (/3 1 — (3\). We consider 9 
over the convex open set 

B (n) = {0 e K s : ||0]| 3 < cev^}, 

with some constant c& > independent of s. 

The idea of the proof is to approximate the stochastic function G n {6) by 
a quadratic function, whose minimizer is shown to possess the asymptotic 
normality. Since G n {9) and the quadratic approximation are close, the min- 
imizer of G n {9) enjoys the same asymptotic normality. Now, we proceed to 
prove Theorem 3. 

Decompose G n {9) into its mean and centralized stochastic component: 
(A.8) G n (9) = Q n (9) + T n (9), 

where Q n (9) = E[G n {9)} and 
(A.9) 

T n {9) = \\p T (e - Z n 9)\\ 1 - ||p r (e)||i - E[\\p T (e - Z n 0)||i - ||p T (e)||i] . 

We first deal with the mean component Q n {9). Since ||0||2 < c§yfs over the 
set Bo(n), it follows from the Cauchy-Schwarz inequality and the assumption 
of the theorem that 

HH^Z^Hoc < ||6»|| 2 max||H 1 / 2 Z^|| 2 = o( S - 3 (log s)" 1 ) . 

Then, since f"(u) is bounded in a small neighborhood of 0, by using a similar 
argument as (A.3) and noting that T,i=i fi(°)\ z L e \ 2 = T (Z^HZ„)0 = 
|| 1| 2 we can show that 

(A.10) £[11^(6 -Z B 0)||i-||Pr(s)lll] 

1 n n 

= H0III + ^E^(°)i z ^i 3 + «(E^°)i z ^i 3 )- 
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Furthermore, since 

Efc=i /i(O)|Z^0| 2 = ||0||| < cjs, it follows that 

(All) 

n n 

Y,fm\^o\ 3 < c||hVX0|UX>(o)|z^| 2 = ^(logs)- 1 ). 

i=l i=l 

Next, we deal with the penalty term in the expected value Q n (G). Since 
^■S T HS has bounded eigenvalues by Condition 2, it follows from the as- 
sumption of the theorem that, for any 6 S Bo(n), 

HVnflHoo < ||V n 0|| 2 < Cn-^ph = o( min \f3* 



Hence, sgn(/3* + V n 0) = sgn(/3^) and 



(A.12) ||d o(^ + V n e)||i- lldoo^Hi = d V n ©, 

where do is a s-vector with j-th component <ijSgn(/3p. Combining (A. 10)- 
(A.12) yields 

(A.13) Q n (0) = \\ef 2 + nA n do V n + o(l) , 

where o(-) is uniformly over all E Bo(n). 

We now deal with the stochastic part T n (Q). Define D = — (p' T (ei), • • • , p' T {e n 
W n = Z^D, and 

fl n (0) = ||p T (£ - Z n fl)||i - ||pr(e)||i - Wj t e. 

Then £[W^0] = and 

(A. 14) T n (0) = W^0 + r n (0), 

where r„(0) = i?n(0) - £7[22„(0)]. Here, W^0 can be regarded as the first 
order approximation of \\p T (e — Z n 6)\\i — ||p r (e)||i. We next show r n (G) is 
uniformly small. By Lemma 3, there exists a sequence b n — > oo such that 
for any e > 0, 

(A.15) P{\r n {0)\ > e) < exp ( - Ceb n s(\og sj) . 

Define A n (0) = G n {0) - nA n do V n - W^0 and A(0) = \\0\\%. Then by 
definition (A. 7), X n (9) and A(0) are both convex functions on the set Bq{ti). 
Furthermore, by definition, we can write r n (6) as 

r re (0) = X n (G) - X(G) - o(l). 
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Since ||0|| 2 < c§yfs for all G B (n), by Condition 1 we have that for any 
0i,0 2 G Bo(n), 

|A(0i)-A(0 2 )| = |(0i + 2 f(0i-0 2 )| 

< ||(0i + 2 )|| 2 ||(0i - 2 )|| 2 < C S ||0! - 2 |U. 

Thus, the above result and (A. 15) indicate that conditions in Lemma 4 are 
satisfied. Then, for any compact set K s = {||0|| 2 < c^^/s} C Bq(ti) with 
< C4 < C6 some constant, 

(A.16) sup |r n (0)| =o p (l). 

OeK a 

Combining (A. 8), (A. 13) and (A. 14) we can write that 

(A.17) G n (0) = ||0||!+nA n doV n + W£0 + r n (0) + o(l) 
(A.18) = ||0-t7 n ||i-h n ||I + r„(0) + o(l), 

where 

Vn = ~ 2 («A n V n d + W n ) . 
By a classic weak convergence result, it is easy to see that 

c r (Z£Z n )-V2w n AiV(0,T(l-T)), 
for any c G R s satisfying c T c = 1. It follows immediately that 

(A.19) c^Z^X)" 1 / 2 (r? n + inA n V n d ) A iV(0, r(l - r)) . 

We only need to show that the minimizer of G n (6) is close to ?7 n , i.e., for 

any e > 0, 

(A.20) P(\\0-Vnh>t) -+0. 

Hence Theorem 3 will follow from (A.19) and Slutsky's lemma. 

We now proceed to prove (A.20). First, let B\(n) be a ball with center 
r\ n and radius e. Since c T W n has asymptotic normal distribution iV(0, 1) 
for any c G R s with c T c = 1, and nV n V^ = (^S r HS) -1 has bounded 
eigenvalues, by definition, we can bound rj n as 

\\Vnh < ^(HWjs + nAjV^dolh) 

(A.21) < ^{O p (V~s) + CA nV ^||d || 2 ) = ^(1 + O p (l)), 
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where the last step is by the assumption A n \/n||do||2 = 0(y/s) of the the- 
orem. Since cq in the definition of Bo(n) can be chosen to be much larger 
than C/2, it follows that for each fixed s, the compact set K s = {||0|| 2 < 
c^^fs} C Bo(n) with C4 large enough can cover the ball B\(n) with proba- 
bility arbitrarily close to 1. Therefore, by (A. 16) 

(A.22) A n = sup |r n (0)| < sup \r n {6)\ = o p (l). 

0e-Bi(n) Oei<s 

Now, we are ready to prove (A. 20). Consider the behavior of G n (0) outside 
of the ball B\(n). Let = T] n + ku 6 R s be a vector outside the ball B\(n), 
where u E R s is a unit vector and k is a constant satisfying k > e, with e 
the radius of B\(n). Define 0* as the boundary point of B\(n) that lies on 
the line segment connecting r\ n and 6. Then we can write 6* = r\ n + eu = 
(1 - e/n)rj n + eO/K. By the convexity of G n , (A.18) and (A.22), 

-G n {6) + (1 - -)G n (r, n ) > G n (9*) > e 2 - \\rjj 2 2 - A n > e 2 + G n (rj n ) - 2A 

Since e < k, it follows that for large enough n, 

(A.23) tof{||9-„ n ||> e } > G n (rj n ) + -Je 2 - o p (l)} > G n ( Vn ). 

This establishes (A. 20) and proves Theorem 3. 

A. 4. Proof of Theorem 4. The idea of the proof follows those used 
in the proof of Theorems 1 and 2. We first consider the minimizer of L n ((3) 
in the subspace {f3 = (0[,0l) T £ W : j3 2 = 0}. Let (3 = (/3f ,0) T , wher e 
fix = /3i+a n vi e R s witha n = ^(log n)/n+ A n (||do||2+C2C 5x /s(logp)/n), 
Il v i||2 = C, and C > is some large enough constant. By the assumptions 
in the theorem we have a n = o(k~ 1 s -1 / 2 ). Note that 

(A.24) L n ((3l + a n vi, 0) - L n {(3\, 0) = ii(vi) + J 2 (vi), 

where Ji(vi) = ||p T (y - S(/3^+ a n vi))||i - ||p T (y - S/3*)||i and J 2 (vi) = 
nA n (||d o (/3i + 5 n vi)||i - ||d o /3^||i) with ||p r (u)||i = ZX=lM u i) for 
any vector u = (ux,--- ,u n ) T . By the results in the proof of Theorem 1, 
E[Ii(yi)] > 2 _1 con||a n vi||2, and moreover, with probability at least 1— n~ cs , 

|ii(vi) - ^[/i(vi)]| < nZ n (Ca n ) < 2a n v / s(logn)n||vi|| 2 . 
Thus, by the triangle inequality, 

(A.25) h(vi) > 2 _1 coa 2 n||vi||2 - 2a n v / s(logn)?i||vi|| 2 . 
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The second term on the right side of (A. 24) can be bounded as 
(A.26) |^(vi)| < nA n ||d o (5 n Vi)||i < na^.A^ || d 1| 2 II vi lb • 

By triangle inequality and Conditions 4 and 5, it holds that 
(A.27) 

||d || 2 < ||do-d*|| 2 +||d*|| 2 < C5||3 i r-/3^|| 2 +||dS||2 < C 2 c 5 y / s(logp)/n+\\d* \\ 2 . 

Thus, combining (A.24)-(A.27) yields 

L n ((3* + 5„vi) - L n (J3*) >2 _1 c na2 ||vi||^ - 2a nV / s(logn)n||vi|| 2 

- na n ,A n ,(||do|| 2 + C 2 c 5 y/ s (log p)/n) ||vi|| 2 . 

Making ||vi|| 2 = C large enough, we obtain that with probability tending 
to one, L n (/3* + a n v) — L n (f3*) > 0. Then, it follows immediately that with 
asymptotic probability one, there exists a minimizer (3 l of L n (/3±,0) such 
that ||/3 1 — /3*|| 2 < Csd n = a n with some constant C3 > 0. 
It remains to prove that with asymptotic probability one, 

(A.28) Hdi 1 o QVr(y - s3i)Hoc < n\ n . 

Then by KKT conditions, (3 = ((3 l , T ) T is a global minimizer of L n (f3). 

Now we proceed to prove (A.28). Since /?* = for all j = s + 1, ■ ■ ■ ,p, 
we have that cf- = p' Xn (0+). Furthermore, by Condition 4, it holds that 
|/3j m | < C 2 y/ s(logp)/n with asymptotic probability one. Then, it follows 
that 

mjnp^d^l) >p' Xn (C 2 y/s(logp)/n). 
Therefore, by Condition 5 we conclude that 

(A.29) IKd^^lU = (min^JI^I))- 1 < 2/^(0+) = 2|| (d*,)" 1 |U. 

J>S 

From the conditions of Theorem 2 with ^ n = a n , it follows from Lemma 
2 (inequality (A. 41)) that, with probability at least 1 — o(p~ c ), 

(A-30) sup ||Q T p;(y " S/30IU < 9N , H ^-ill (! + 

ll/3i-/3tl|2<C 3 a„ z \\K a l) Woo 

Combining (A.29)-(A.30) and by the triangle inequality, it holds that with 
asymptotic probability one, 

sup || (Si)" 1 o QVr(y - S/3i)||oc < n\ n . 

Pi-/31l|2<C 3 o„ 

Since the minimizer fli satisfies \\f3i — /3i|| 2 < C^,a n with asymptotic prob- 
ability one, the above inequality ensures that (A.28) holds with probability 
tending to one. This completes the proof. 



30 



FAN ET AL. 



A. 5. Proof of Theorem 5. The proof of Theorem 5 follows from those 
of Theorems 3 and 4. We use C to denote a generic constant in the proof. By 
Theorem 4, with asymptotic probability one, there exists a global minimizer 

3 = (3i ,0 T ) T of L n (J3) and ||3i - PXh < «n- 

Next we study the asymptotic normality of /3 1 . Following (A. 7) in the 
proof of Theorem 3, define 

G n (0) = L n (V n G + (31,0)- L n (ft,0), 

where V n and are the same as in the proof of Theorem 3. Then n — 
V~ 1 (/3 1 — (3\) is a global minimizer of G n {6). The idea of the proof is to show 
that G n {9) in (A. 7) and G n {6) are uniformly close to each other. Since G n {6) 
can be well approximated by a sequence of quadratic functions, G n {6) can 
be well approximated by the same sequence of quadratic functions. Thus, 
the minimizer of G n {9) enjoys the same asymptotic properties as that of 
G n {6). 

We now proceed to prove that G n (9) and G n (6) are uniformly close. To 
this end, first note that for any (3 1 with 11/3x112 < Cyfs, 

(A.31) 

\L n {p Xl Q) - I„G9i,0)| < nXnWPMdl - do|| 2 < Cn\ n ^\\d* - d \\ 2 . 
For 1 < j < s, by the mean-value theorem, 

(a.32) \d* -dj\ = \ P ' Xn (\p*\) -p' An (i^fi)i = \ P u%m-p?% 

where f3j lies on the segment connecting (3* and fij 11 . By Condition 4 and 
the triangle inequality, with asymptotic probability one 

%\ > - \Pj ~ P*\ > |/3*| - C 2 y/s(\ogp)/n > 2^ mm\(3*\. 

This together with Condition 6 ensures that p'\ n {\[3j\) = o p (s^ 1 X^ 1 (nlogp)^ 1 / 2 ) . 
Thus, in view of (A.32), 

||d - d$|| 2 < o^X-^nlogp)- 1 / 2 )^ - 3f || 2 < o p { S - l ' 2 X- l n~ l ). 

Since ||0||2 < Cy/s ensures ||/3i||2 < Cy/s, the above inequality combined 
with (A.31) entails that 

sup \G n (0) - G n (0)\ = sup \L n (J3 1 ,0)-L n (p 1 ,0)\=o p (l), 
eeB (n) \\P-Lh<Cy/I 
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where Bq{ti) is defined in Theorem 3. Therefore, by the above result and 
(A. 23), for any € Po( n ) and \\0 — T] n \\2 > e with e > arbitrarily small, 

inf G n (0)> inf G n (0)- sup \G n (0) - G n {9)\ 

{\\Q-V n h>t} ||0-TjJ| 2 >e 0&B (n) 

> G n ( Vn ) + ~y - 0p (i)] - 0p (i) 

>G n ( Vn ) + ^[e 2 -o p (l)]-o p (l). 

Then, it follows immediately that the minimizer ||0 n — VnWz — 6 with asymp- 
totic probability one. Thus 9 n — r) n = o p (l). The proof of Theorem 5 is 
completed. 

A. 6. Proof of Proposition 1. Since s is finite, the summation of the 
probability below is of the same order as the maximum of the probability 
below. The distribution result (3.1) entails that 

1 s 
Pfll-S^elloo >z) = P(max Ixfel > nz) ~ VP(|xfe| > Cnz) 
n i<j<s J L — ' J 

3=1 

s 

~^2\\^\\«c a (Cnz)- a , 

3=1 

where C > is some generic constant. For any sequence b n such that b n —> 
oo, by letting z = n~ l b n b n with b n = (X)j=i ll%lla) i we have that 

Pdln^S^Hoo >n- 1 6 n 6 n )^0. 

In other words, II Ti s||oo — Op{ji ^byij. Hence, for (3.4) to have ct solution 
/3 = (Jj u ■ ■ ■ , /3 S ) T with > for all j = 1, ■ ■ ■ , s, the necessary conditions 
are f3Qnb~ l — > oo and A n < Combining these conditions we have 

(A.33) \ n < fa = n-HnK, 

with some diverging sequence b n . 

We next check condition (3.5). Combining (3.1) and (A.33) ensures that 
for j > s, 

P(|xje| > n\ n ) > P(|xje| > b n b n ) ~ || Xj \\ a a c a {b n b n )- a . 

Since we assumed supp(xj)nsupp(xj) = for any i,j £ {s+1, • • • ,p}, it fol- 
lows that Q T e is a vector of independent random variables with components 
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xje. Then, 

P^IIQ^Iloo > nA n ) = 1 - ^(lIQ^elloo < nA n ) = 1 - J] (l - P(\xje\ > n\ n 

j>s 

> 1 - J] (1 - CWijWZibnK)-*) = 1 - exp ( J^log (1 - C|| Xj ||-(6 n 6 n )- Q )) . 

j>s j>s 

Since log(l + x) < x for all x > —1, we have that 

j>s j>s 

Combining the above two inequalities, if Ylj> s \\a(,b n b n )~ a — > cq/C G 
(0, oo], then 

^(lIQ^Iloo >nA n ) > l-e- co . 

That is, with probability at least 1 — e _c °, (3.5) fails to hold and Lasso 
in (3.3) does not have the model selection oracle property. In fact, if b n < 
°( n (Y, j>s INIS) 1 ^ 1 ), or equivalents by (A.33), 

j>s 

then co/C £ (0,oo]. In other words, unless we have 
(A.34) nPo(£\\*j\\Zy 1/a ^ 

j>s 

Lasso is not able to recover the true model and the correct sign. 

Finally, since 1 1 5c ^ 1 1 2 = y/n, |supp(xj)| = 0(n 1 ^ 2 ) andmaxjj \xy \ = 0(n 1//4 ), 
the nonzero components of Q are all of the same order 0(n 1//4 ). Conse- 
quently, ||xj||2 must be all of the same order 0(n( 2+a )/( 4 )). 

Then, n{^2j >s ||xj H"} -1 /" = O (ni~^ and the condition (A.34) above 

3 1 

becomes wi'aft — > 00. 
A. 7. Lemmas. 



Lemma 1. Under Condition 2, for any t > 0, we have 
(A.35) P(Z n {M) > AM^f7/n~ + t) < exp ( - nc t 2 /(8M 2 )) . 
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PROOF. Define p(s,y) = (y — s)(t - l{y — s < 0}). Then, v n {(3) in 
(A.l) can be rewritten as v n {fi) = J27=i P( x i A Ui)- Note that the following 
Lipschitz condition holds for p(-,yi) 



(A.36) 



\p(si,yi) - p(s2,yi)\ < max{r, 1 - t}\si - s 2 | < |si - s 2 \. 



Let W\ % - ■ ■ ,W n be a Rademacher sequence, independent of model errors 
£i,--- ,e n . The Lipschitz inequality (A.36) combined with the symmetriza- 
tion theorem and Concentration inequality (see, for example, Theorems 14.3 
and 14.4 in Biihlmann and van de Geer (2011)) yields that 

I 1 n 

E[Z n {M)] < 2E sup - TWiip^P^i) -p(xj(3*, yi )) 



i 1 n 

<4E sup -VWi^-jfir) 



(A.37) 

On the other hand, by the Cauchy-Schwarz inequality 



E^x^-xf/T 



i=i 



E(E^)(&-^) 

3=1 i=l 



2i 1/2 



<\\Pi-Pih{Y,\Y, w t x v\} 

3=1 i=l 

By Jensen's inequality and concavity of the square root function, E (A 1 / 2 ) < 
(EX) 1/2 for any non-negative random variable X. Thus, these two inequal- 
ities ensure that the very right hand side of (A.37) can be further bounded 
by 



i 

sup P-^lbfifrj-E^ll 



1/2 



/3eBo(M) 



i=i 



(A.38) 



^ ^ 1 /2 

E - =1 E \ - E ^l 2 } = M ^ 

«=i 



Therefore, it follows from (A.37) and (A.38) that 
(A.39) E[Z n {M)} < 4M^/JJ^. 



Next since n 1 S T S has bounded eigenvalues, for any f3 = {0{,O 1 ) G 



Bo(M), 



1 1 

- E(xf(/3-/3*)) 2 = -Oi-flfS^OSi-ft) < Coi^-^IH < c^M 2 
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Combining this with the Lipschitz inequality (A. 36), (A. 39), and applying 
Massart's concentration theory (see Theorem 14.2 in Biihlmann and van de Geer 
(2011)) yields that for any t > 0, 



P(Z n {M) > 4My/s/n + t) < exp ( - nc t 2 /(8M 2 )) . 
This proves the Lemma. 



□ 



Lemma 2. Consider a ball in R s around (3*: J\f = {(3 = (0{,0^) T <E 
R p : (3 2 = 0, \\(3i — (3\\\2 < 7n} with some sequence j n — > 0. Assume that 
mm j>s dj > c 3 , V'l +7nS 3/2 K-l log 2 n = o(y/n \ n ), n 1 / 2 A n ,(logp)" 1 / 2 -)■ oo ; 
and Kn'y 2 = o(X n ).Then under Conditions 1-3, there exists some constant 
c > such that 

P( sup ||d^ o QVr(y - S/SOIloo > n\ n ) < o(p~ c ), 

where p' T (u) = r — \{u < 0}. 

Proof. For a fixed j G {s + 1, ■ ■ ■ ,p} and (3 = {0[,fi\) T G Af, define 

7j9j(Xi) Vi) = Xij [p'riVi ~ xf/3) - Pr( £ i) ~ E \Pr{Vi ~ x f 0) ~ Pr( e i)]] > 

where = (xn, • • • , Xj p ) is the i-th row of the design matrix. The key for 
the proof is to use the following decomposition 



sup 



-QVrCy-SA 

n 



< sup 
00 /fleAf 



iQ^pUy-S^)-^^)]! 



(A.40) 



+ 



1 1 n 

-QVr( £ ) + max SU P - Y] 1 7/3 7 -(x»,2/i)|- 



/9eAT- <=1 

We will prove that with probability at least 1 — o(p~ c ), 



1 



A, 



< 

oo 2||d 1 " 1 | 



+ o(A n ), 



(A.41) /^sup -Q^IpUy-S^)-^^)] 

(A.42) J 2 = n-^QVrC^lloo = o(X n ), 

1 " 

(A.43) J 3 = max sup I - V" 7/3,7 (xi, 2/*) I = o p (A n ). 

Combining (A.40)-(A.43) with the assumption minj> s dj > C3 completes 
the proof of the Lemma. 
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Now we proceed to prove (A.41). Note that I\ can be rewritten as 

1 n 

(AAA) h = max sup I - V] XijE[p' T (ei) - p' T (yi - xf 0)] I . 

By Condition 1, 

E[p' T (si) - P ' T {yi - xfP)] = Fi(Sf (P, - (3\)) - Fi(0) = fiWliPi ~ 0i) + h 

where F(t) is the cumulative distribution function of £j, and Ii = Fi(Sf (f3± — 
(3D) - Fi(0) - /i(0)Sf {(3 1 - (31). Thus, for any j > s, 

n n n 

J2^E[p' T (e t ) - p'AVi ~ xf/3)] = ^(fi^XijSfXO, - PI) + ^ Xij h. 

i=l i=l i=l 

This together with (A. 44) and Cauchy-Schwartz inequality entails that 

1 1 n 

(A.45) h < ll-Q^S^-^^IU + maxl-Vx^l, 

n j>s n 

i=l 

where H = diag{/i(0, • • • , / n (0))}. We consider the two terms on the right 
hand side of (A.45) one by one. By Condition 3, the first term can be 
bounded as 

(A.46) H^HSOSj - ||oc < H-CfHSIb.ocp! - (3\\\ 2 < - ■ 



n ^ ~ "n ^ 11 ' 2||di 

By Condition 1, \Ii\ < c(Sf(P 1 - PI)) 2 . This together with Condition 2 
ensures that the second term of (A.45) can be bounded as 

1 n n n 

max \- < ^ £ \h\ < ^( S H/3i " PD? < ^JPi - Pl\\l 

i=l i=l i=l 

Since /3 G A/", it follows from the assumption A~ X K n 7^ = o(l) that 

i 1 n 

max - V] Xij/j < Ck„7^ = o(A n ). 
j>s In t— f 

Plugging the above inequality and (A.46) into (A.45) completes the proof 
of (A.41). 
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Next we prove (A. 42). By Hoeffding's inequality, if A n > 2y(I + c)(logp)/?7, 
with c is some positive constant, then 




= 2exp (\og(p - s) - nXl/ij < 0(p~ c ). 

Thus, with probability at least 1 — 0(p~ c ), (A. 42) holds. 

We now apply Corollary 14.4 in Biihlmann and van de Geer (2011) to 
prove (A. 43). To this end, we need to check conditions of the Corollary. For 
each fixed j, define the functional space Tj = {7/3 j : f3 £ M}. First note 
that EY)p j(xj, yi)] = for any jpj E Tj. Second, since the p' r function is 
bounded, we have 

^ n 1 n 

71 i=l n i=l 

-E(p' T (y l - X [p)-p T (e l ))) 2 <4. 

/ n \l/2 

Thus, ||7/3,j||n = (n" 1 'Ei=l7% IXiiVif) < 2 - 

Third, we will calculate the covering number of the functional space Tj, 

N(-,Tj, II • || 2 ). For any (3 = (Pj,/3%) T £ N and /3 = CA^Y G AT, by 
Condition 1 and the mean value theorem, 

(A.47) E^fa - xf/3) - p'M)\ ~ E[Pr(Vi ~ *IA ~ P'M)] 

= F^JCPi - (3D) - FiiSjfa - (3D) = /iK)Sf(/3i - A), 

where F(t) is the cumulative distribution function of £j, and an lies on the 
segment connecting Sj (0 1 — j3\) and Sf((3 1 — f3\). Let K n = maxjj \xij\. 
Since /i(u)'s are uniformly bounded, by (A.47), 

(A.48) 

\xijE[p' T {yi - xf/3) - p' T (£i)] - XijE[p' T ( yi - xf/3) - p' T (£i)]\ 

< c\x lJ sf{f3 1 - A)\ < cWxijSihWO, - Ah < c^lWPi - Ah, 

where C > is some generic constant. It is known (see, for example, Lemma 
2.5 in Biihlmann and van de Geer (2011)) that the ball J\f in R s can be 
covered by (1 + 47 n /J) s balls with radius 5. Since p' T (yi — x?p) — p' T (£i) 
can only take 3 different values {—1,0,1}, it follows from (A.48) that the 
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covering number of Tj is N(2 2 ~ k ,T j , || • || 2 ) = 3(1 + C^^s 1 / 2 n 2 n ) s . Thus, 
by calculus, for any < k < (log 2 n)/2, 

log(l + iV(2 2 - fc ,r,|| • || 2 )) <log(6)+ S log(l + C7- 1 2S riS 1 /2 K 2) 

< log(6) + C- 1 2VkI < 4(1 + C-V/ 2 4)2 2k . 

Hence, conditions of Corollary 14.4 in Biihlmann and van de Geer (2011) 
are checked and we obtain that for any t > 0, 

(l 1 n 8 I \ / nt 2 

sup -y>/3,;(x i5 yi) > ^(3Jl + C- 1 7n s 3 / 2 4k)g 2 n + 4 + 4t) < 4exp - — 



i=l 



Taking t = y / C (log p)/n with C > large enough constant we obtain that 



pfmaxsup I- V]7 / 3 i (x i ,y i )| > -^=Jl + C 1 -f n s 3 / 2 K 2 L \og 2 



<4(p- S )exp(-^^0. 



Thus if \/l + jnS 3 / 2 ^ log 2 n = o(y / n A n ), then with probability at least 
1 — o(p~ c ), (A. 43) holds. This completes the proof of the Lemma. 

□ 

Lemma 3. Assume conditions of Theorem 4 hold. Let R n , t i(6) = p T {£i — 
PM) + p'My&nfi andR n (0)=Y. n i=l Rni{0). Then for any e > 0, 

P(\R n (0) - E[R n (0)]\ > e) < exp ( - C eb n s 2 (log s)) , 

where b n is some diverging sequence such that b n s 7 ^ 2 (\ogs) maxj ||Z n j|| 2 — > 0, 
and C > is some constant. 

Proof. Let & = R nti (0)-E[R n>i (0)}. Then R n {6) - E[R n {6)} = 
Since -R nj i(#)'s are independent, by Markov's inequality we obtain that for 
any e > and t > 0, 

n 

p(R n (6) - E[R n {6)\ > e) < e^Jexp 

i=l 

n n 

(A.49) = exp ( - te - tj^ E[R n A d )]) \{ E[exp(tR n 40))}. 



i=l 
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We next study E[R ni (0)] and E[exp (tR ni (9))} in (A.49). Using a similar 
argument to that for (A. 10) we can prove that 

E[R n ^6)\ = Elp^Si - Zlft) - Pr (Ei)} = /,(O)(Z^0) 2 + O((Z^0) 3 ), 

where O(-) is uniformly over all i. Thus, it follows from the definition of Z n j 
that 

n n 

(a.50) tj2E[Rn,m =t\\ni+o{tj2K^) 3 )- 

i=i i=i 

Now, we consider E[ exp (ii? n j(0))] . If Z T ni Q > 0, then by definition R n ^(0) = 
(Z^j# — Ei)l{0 < Si < Z^iO}. By Condition 1 and Taylor expansion it fol- 
lows that 

£[exp {tRn^O))] < 1 + (exp(tZ^6>) - l)P(0 < < Z^0) 
<l + /,(0)i|Z^| 2 + O(t 2 |Z^| 3 ). 

When Z T ni B < 0, we can get the same result using a similar argument. Since 

niLiU + x i) — ex P(J2i=i x i) f° r x i > 0j m view of (A. 11) and the above 
inequality we obtain that 



exp (1^(0))] <exp^E[exp(^ nj4 (0)) -1 

i=l i=l 

n 

(A.51) <exp(t||0|| 2 + O(t 2 ^|Z^0 



i=l 

Substituting (A.50) and (A.51) into (A.49) gives 

n 

(A.52) p(Ru(0) ~ E[R n {6)} > e) < exp ( - te + o(t 2 ^ |Z^0 

i=l 

Choosing i = 2s 2 (log s)6 n with b n — > oo such that 6 n s 7/,2 (log s) maxj ||Z n j||2 
0, and using similar idea to that for (A. 11) we obtain that 

n n 

tJ2\Zl 4 e\ 3 < Ctmax|Z^0| ^ /j(O)|Z^0| 2 < Cts 3/2 max ||Z n)i || 2 -»• 0. 
«=i j=i * 

Plugging this into (A.52) yields that 

p(R n {9) - E[R n {6)] > e) < exp ( - C eb n s 2 (log s)) . 



Repeating the same argument for P\^R n (0) — E[R n (9)} < —ej completes 
the proof. □ 
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Lemma 4. Let A(0) be a positive function defined on a convex, open 
subset Q s = {9 £ R s : ||0|| 2 < c 6 ^i} of ~R S , and {X n (9) : 9 £ @ s } be a 
sequence of random convex functions defined on Q s , where c% > is some 
constant. Suppose that there exists some b n — > oo such that for every 9 £ O s , 
the following holds for all e > 

P(|A re (0)-A(0)| >e) <c 9 e X p(-c 7S 2 (logs)6 ri e), 

where ct,cq are two positive constants. Let K s be a compact set in R s such 
that K s = {|| #|| 2 < c^Vs} C Q s , where C4 < cq is some positive constant. 
If, for some constant cs > 0, |A(0i) — A 2 (02)| < cgs||0i — 02||oo for any 
01, 02 £ @s ; £/ien 

sup \X n (G) - A(0)| = op(l). 

Proof. The proof is an extension of the convexity lemma in Pollard 
(1990). The basic idea is to prove that K s can be covered by a number of 
cubes, and A n (0) and A(0) are uniformly close over the set of vertices of 
these cubes. Within each cube, values of both A n (0) and A(0) do not change 
much. Thus A n (0) and A(0) are uniformly close over K s . In this proof we 
use C to denote some generic positive constant. 

We proceed to prove the lemma. Since |A(0i) — A(02)| < cgs||0i — 02||oo 
for any 9\, 9i £ S , it follows that for a fixed e > 0, the function A(0) varies 
by less than e/s over each cube of side 5 = e/{s 2 c%) that intersects K s . Note 
that K s can be covered by less than (2ci^fs / 5) s = (2c4Css 5//2 ) s such cubes. 
Then in total, there are less than 2 s (2c4C8S 5 / 2 ) 5 vertices. Denote by 5J S the 
set of all such vertices whose cubes intersect K s . Since cq can be much larger 
than C4 and the edge of each cube, 5 = e/{c%s 2 ), is small and decreases with 
s, all vertices in QJ S fall in @ s as well. Thus by the pointwise convergence 
assumption in the Lemma, it is easy to derive that for any e > 0, as b n — > 00, 

P(max|A n (0) - A(0)| > e/s) < c 6 exp (Cslog(Cs) - c 7 s(log s)b n e) 0. 
Therefore, 

(A.53) M n = max |A n (0) - A(0)| = o p (e/s). 

For any £ K s , it will fall into a cube and thus can be written as the 
convex combination of this cube's vertices {0«} in that is, = cti9i 
with Qj £ [0, 1). Then by the convexity of A n (0) and (A.53), 

A n (0) < «A n (0i) < £ a l {|An(0 l ) - A(0i)| + |A(0i) - A(0)| + A(0)} 

i 

< max |A n (0i) - A(0i)| + e/s + A(0) < M n + e/s + A(0). 
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Thus, 

(A.54) p( sup (\ n (0) - A(0)) > 2e/s) -> 0. 

Next we prove the lower bound. Each 6 in K s lies within a 5-cube with a 
vertex 6q in 53 s , thus 



= #o + ^i e i with \5i\ < 5, 



i=l 

where ei, • • • ,e s denote s coordinate directions. Without loss of generality 
suppose < Si < 5 for each i. Define Q{ to be the vertex 6q — 5ej in 33 s . 
Then 6q can be written as a convex combination of and 0^. 

Denote by a = s+ j ^ s- an< ^ ai = 6+^2- 6- • Since < Oj < 5, it follows that 
the coefficient for 6 can be bounded as a > jt^. Since A n (-) is convex, by 
(A. 53) we obtain that 

aX n (0) > A„(0 O ) - Y, a i X n(°i) > Wo) -^OiXtPi) - 2M n 

i i 

> A(0) - - - V Oi(X(0) + -)- 2M n > a\(0) - — - 2M n . 

s s s 

Since a > (1 + s) _1 , it follows from the above inequality and (A. 53) that 

A„(0) - A(0) > (-- - 2M n )(s + 1) > - 2e(5 + 1) - o(l). 
s s 

Therefore, 

(A.55) P( inf (A n (0) - A(0)) < -3(1 + s)-) -> 0. 

\e&K B sJ 

Since we can choose e arbitrarily small, the uniform convergence result 
follows easily by combining (A.54) with (A.55). □ 
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